A bank offers different kind of bank deposits to their clients and want to classify them in order to recommend them a sort of deposit according to their needs
Data has 48678 observations and 15 variables. The rows represent people who live California and the columns some characteristics of them like age. The last col Income tell us if individual gain more than 50k dollards per year.
| Age | Workclass | Fnlg | Education | Education_Num | Marital_status | Occupation | Relationship |
|---|---|---|---|---|---|---|---|
| 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child |
| 36 | Empl-gov | 212465 | Bachelors | 13 | Married | Adm-clerical | Husband |
| 25 | Private | 220931 | Bachelors | 13 | Never-married | Prof-specialty | Not-in-family |
| 22 | Private | 236427 | HS-grad | 9 | Never-married | Adm-clerical | Own-child |
| Race | Gender | Capital_Gain | Capital_Loss | Hours_Per_Week | Native_Country | Income |
|---|---|---|---|---|---|---|
| Black | Male | 0 | 0 | 40 | United-States | <=50K |
| White | Male | 0 | 0 | 40 | United-States | <=50K |
| White | Male | 0 | 0 | 43 | Peru | <=50K |
| White | Male | 0 | 0 | 20 | United-States | <=50K |
Missing values
Outliers detection
Data transformation
We remove rows with mising values, we have enough data
| x | |
|---|---|
| Age | 0 |
| Workclass | 2799 |
| Fnlg | 0 |
| Education | 0 |
| Education_Num | 0 |
| Marital_status | 0 |
| Occupation | 2809 |
| Relationship | 0 |
| Race | 0 |
| Gender | 0 |
| Capital_Gain | 0 |
| Capital_Loss | 0 |
| Hours_Per_Week | 0 |
| Native_Country | 857 |
| Income | 0 |
We do not remove outliers because are relevant to the problem
\[ {\large Benefits}={\large Capital Gain-CapitalLoss} \]
| Capital_Gain | Capital_Loss | Benefits |
|---|---|---|
| 0 | 1721 | -1721 |
| 3103 | 0 | 3103 |
| 3674 | 0 | 3674 |
| 2174 | 0 | 2174 |
| 3411 | 0 | 3411 |
| 0 | 1721 | -1721 |
| 2907 | 0 | 2907 |
| 4386 | 0 | 4386 |
| 5013 | 0 | 5013 |
We modify labels of categorical variable Marital_status
Does variable Marital_status serve us to predict Income?
We solve the imbalance class problem using undersampling
We split data in 3 sets train, validation and test
Clasification techniques:
Logistic Regression
Linear and quadratic discriminant analysis
K Nearest Neighbor
Random forest
Boosting
Support Vector Machines
Naive Bayes Classifier
We apply stepwise selection using all variables.
\[ {\Large Income \sim .} \]
We decide use 5 variables to train our models because to add variables do not increase significantly accuracy of model
\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Age+Hoursperweek+Benefits} \]
| Age | WorkclassEmpl-self | WorkclassPrivate | WorkclassWithout-pay | Education_Num | Marital_statusMarried | Marital_statusNever-married | Marital_statusSeparated | Marital_statusWidowed | Hours_Per_Week | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) |
|
|||||||||
| 2 ( 1 ) |
|
|
||||||||
| 3 ( 1 ) |
|
|
|
|||||||
| 4 ( 1 ) |
|
|
|
|
||||||
| 5 ( 1 ) |
|
|
|
|
| RelationshipOther-relative | RelationshipOwn-child | RelationshipUnmarried | RelationshipWife | RaceAsian-Pac-Islander | RaceBlack | RaceOther | RaceWhite | GenderMale | Benefits | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) | ||||||||||
| 2 ( 1 ) | ||||||||||
| 3 ( 1 ) | ||||||||||
| 4 ( 1 ) | ||||||||||
| 5 ( 1 ) |
|
\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Age+Hoursperweek+Benefits} \]
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6942 | 1502 |
| >50K | 1980 | 6911 |
| x | |
|---|---|
| Accuracy | 0.7991347 |
| Sensitivity | 0.7780767 |
| Specificity | 0.8214668 |
Coefficients and significativity of variables
| Estimate | Std..Error | z.value | Pr…z.. | Significativity | |
|---|---|---|---|---|---|
| (Intercept) | -8.4274 | 0.1798 | -46.8794 | 0.0000 | TRUE |
| Education_Num | 0.3828 | 0.0094 | 40.8500 | 0.0000 | TRUE |
| Age | 0.0300 | 0.0019 | 15.7987 | 0.0000 | TRUE |
| Marital_statusMarried | 2.2580 | 0.0678 | 33.2836 | 0.0000 | TRUE |
| Marital_statusNever-married | -0.3662 | 0.0854 | -4.2907 | 0.0000 | TRUE |
| Marital_statusSeparated | -0.0815 | 0.1607 | -0.5070 | 0.6121 | FALSE |
| Marital_statusWidowed | -0.1260 | 0.1591 | -0.7917 | 0.4285 | FALSE |
| Hours_Per_Week | 0.0365 | 0.0020 | 18.5478 | 0.0000 | TRUE |
| Benefits | 0.0002 | 0.0000 | 21.9089 | 0.0000 | TRUE |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2299 | 479 |
| >50K | 616 | 2316 |
| x | |
|---|---|
| Accuracy | 0.8082312 |
| Sensitivity | 0.7886792 |
| Specificity | 0.8286225 |
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.8007005 | 0.7008576 | 0.9048301 | 0.4000000 |
| 0.8043783 | 0.7413379 | 0.8701252 | 0.4500000 |
| 0.8070053 | 0.7835334 | 0.8314848 | 0.4941281 |
| 0.8082312 | 0.7886792 | 0.8286225 | 0.5000000 |
| 0.8010508 | 0.8102916 | 0.7914132 | 0.5300000 |
| 0.7945709 | 0.8370497 | 0.7502683 | 0.5700000 |
| 0.7781086 | 0.8679245 | 0.6844365 | 0.6200000 |
Comparisson models
| numb_variables | anova_logistic..Resid..Df. | anova_logistic..Resid..Dev. |
|---|---|---|
| 1 | 17330 | 19065.23 |
| 2 | 17329 | 16400.06 |
| 3 | 17328 | 16117.06 |
| 4 | 17327 | 15692.77 |
| 5 | 17326 | 14771.59 |
| 6 | 17313 | 14227.68 |
| 7 | 17320 | 14504.95 |
| 8 | 17307 | 13978.14 |
| 9 | 17307 | 13978.14 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6475 | 1226 |
| >50K | 2447 | 7187 |
| x | |
|---|---|
| Accuracy | 0.7881165 |
| Sensitivity | 0.7257341 |
| Specificity | 0.8542731 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2114 | 377 |
| >50K | 801 | 2418 |
| x | |
|---|---|
| Accuracy | 0.7936953 |
| Sensitivity | 0.7252144 |
| Specificity | 0.8651163 |
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.7879159 | 0.6826758 | 0.8976744 | 0.4000000 |
| 0.7894921 | 0.7008576 | 0.8819320 | 0.4500000 |
| 0.7936953 | 0.7252144 | 0.8651163 | 0.5000000 |
| 0.7966725 | 0.7488851 | 0.8465116 | 0.5375239 |
| 0.8000000 | 0.7825043 | 0.8182469 | 0.5700000 |
| 0.7898424 | 0.8202401 | 0.7581395 | 0.6200000 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6880 | 1374 |
| >50K | 2042 | 7039 |
| x | |
|---|---|
| Accuracy | 0.8029420 |
| Sensitivity | 0.7711275 |
| Specificity | 0.8366813 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2269 | 454 |
| >50K | 646 | 2341 |
| x | |
|---|---|
| Accuracy | 0.8073555 |
| Sensitivity | 0.7783877 |
| Specificity | 0.8375671 |
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.7959720 | 0.7200686 | 0.8751342 | 0.4000000 |
| 0.8015762 | 0.7440823 | 0.8615385 | 0.4546504 |
| 0.8073555 | 0.7783877 | 0.8375671 | 0.5000000 |
| 0.8078809 | 0.7927959 | 0.8236136 | 0.5200000 |
| 0.7996497 | 0.8373928 | 0.7602862 | 0.5700000 |
| 0.7831874 | 0.8809605 | 0.6812165 | 0.6200000 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6909 | 1127 |
| >50K | 2013 | 7286 |
| x | |
|---|---|
| Accuracy | 0.8188636 |
| Sensitivity | 0.7743779 |
| Specificity | 0.8660407 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2170 | 425 |
| >50K | 745 | 2370 |
| x | |
|---|---|
| Accuracy | 0.7950963 |
| Sensitivity | 0.7444254 |
| Specificity | 0.8479428 |
| Accuracy | Sensitivity | Specificity | Neighbors |
|---|---|---|---|
| 0.7866900 | 0.7777015 | 0.7960644 | 1 |
| 0.7903678 | 0.7927959 | 0.7878354 | 2 |
| 0.7938704 | 0.7687822 | 0.8200358 | 3 |
| 0.7919440 | 0.7698113 | 0.8150268 | 4 |
| 0.7907180 | 0.7516295 | 0.8314848 | 5 |
| 0.7924694 | 0.7567753 | 0.8296959 | 6 |
| 0.7926445 | 0.7478559 | 0.8393560 | 7 |
| 0.7912434 | 0.7499142 | 0.8343470 | 8 |
| 0.7915937 | 0.7437393 | 0.8415027 | 9 |
| 0.7928196 | 0.7481990 | 0.8393560 | 10 |
| 0.7950963 | 0.7444254 | 0.8479428 | 11 |
| 0.7945709 | 0.7475129 | 0.8436494 | 12 |
| 0.7947461 | 0.7409949 | 0.8508050 | 13 |
| 0.7935201 | 0.7447684 | 0.8443649 | 14 |
| 0.7917688 | 0.7413379 | 0.8443649 | 15 |
| 0.7924694 | 0.7440823 | 0.8429338 | 16 |
| 0.7908932 | 0.7399657 | 0.8440072 | 17 |
| 0.7903678 | 0.7399657 | 0.8429338 | 18 |
| 0.7903678 | 0.7385935 | 0.8443649 | 19 |
| 0.7894921 | 0.7396226 | 0.8415027 | 20 |
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.7851138 | 0.6627787 | 0.9127013 | 0.4000000 |
| 0.7938704 | 0.6974271 | 0.8944544 | 0.4500000 |
| 0.7970228 | 0.7276158 | 0.8694097 | 0.4902072 |
| 0.7950963 | 0.7444254 | 0.8479428 | 0.5000000 |
| 0.7905429 | 0.7807890 | 0.8007156 | 0.5500000 |
| 0.7851138 | 0.8312178 | 0.7370304 | 0.6200000 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6890 | 1315 |
| >50K | 2022 | 7095 |
| x | |
|---|---|
| Accuracy | 0.8073548 |
| Sensitivity | 0.7731149 |
| Specificity | 0.8436385 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2304 | 380 |
| >50K | 611 | 2415 |
| x | |
|---|---|
| Accuracy | 0.8264448 |
| Sensitivity | 0.7903945 |
| Specificity | 0.8640429 |
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.8239930 | 0.7554031 | 0.8955277 | 0.40000 |
| 0.8246935 | 0.7749571 | 0.8765653 | 0.45000 |
| 0.8271454 | 0.7962264 | 0.8593918 | 0.50000 |
| 0.8271454 | 0.7962264 | 0.8593918 | 0.53125 |
| 0.8271454 | 0.7962264 | 0.8593918 | 0.55000 |
| 0.8169877 | 0.8267581 | 0.8067979 | 0.65000 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 7133 | 1408 |
| >50K | 1789 | 7005 |
| x | |
|---|---|
| Accuracy | 0.8155754 |
| Sensitivity | 0.7994844 |
| Specificity | 0.8326400 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2313 | 398 |
| >50K | 602 | 2397 |
| x | |
|---|---|
| Accuracy | 0.8248687 |
| Sensitivity | 0.7934820 |
| Specificity | 0.8576029 |
Eficient iteration’s number
| Accuracy | Sensitivity | Specificity | threshold |
|---|---|---|---|
| 0.8112084 | 0.6939966 | 0.9334526 | 0.4000000 |
| 0.8199650 | 0.7396226 | 0.9037567 | 0.4500000 |
| 0.8278459 | 0.7886792 | 0.8686941 | 0.4912833 |
| 0.8248687 | 0.7934820 | 0.8576029 | 0.5000000 |
| 0.7994746 | 0.8888508 | 0.7062612 | 0.5300000 |
| 0.7530648 | 0.9416810 | 0.5563506 | 0.5700000 |
| 0.6604203 | 0.9938250 | 0.3127013 | 0.6200000 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 6221 | 1061 |
| >50K | 2752 | 7375 |
| x | |
|---|---|
| Accuracy | 0.7809754 |
| Sensitivity | 0.6933021 |
| Specificity | 0.8742295 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2047 | 343 |
| >50K | 912 | 2429 |
| x | |
|---|---|
| Accuracy | 0.7810155 |
| Sensitivity | 0.6917878 |
| Specificity | 0.8762626 |
Parameter cost
| cost | error | dispersion |
|---|---|---|
| 0.001 | 0.2586014 | 0.0109569 |
| 0.010 | 0.2441260 | 0.0089190 |
| 0.100 | 0.2321784 | 0.0087825 |
| 1.000 | 0.2213217 | 0.0097484 |
| 5.000 | 0.2197711 | 0.0088204 |
| 10.000 | 0.2197137 | 0.0086724 |
| 25.000 | 0.2190245 | 0.0099011 |
Training confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 8645 | 339 |
| >50K | 5651 | 2803 |
| x | |
|---|---|
| Accuracy | 0.7792178 |
| Sensitivity | 0.7265138 |
| Specificity | 0.8352259 |
Testing confusion matrix and statistics
| <=50K | >50K | |
|---|---|---|
| <=50K | 2117 | 453 |
| >50K | 828 | 2301 |
| x | |
|---|---|
| Accuracy | 0.7752237 |
| Sensitivity | 0.7188455 |
| Specificity | 0.8355120 |
| usekernel | fL | adjust | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|---|---|
| TRUE | 0 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
| TRUE | 1 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
| TRUE | 2 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
| TRUE | 3 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
| TRUE | 4 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
| TRUE | 5 | 2 | 0.7778983 | 0.5570333 | 0.0064239 | 0.0128679 |
Curvas ROC
Method Boosting is the best apropiated technique to perform the clasification model
The most influential variables are Marital_status, Education_Num, Benefits…
Now company want to analyze customers who participate in financial market. Company want to recommend different financial products to their clients according to their needs.
We do not remove outliers because are relevant to the problem
We want to analyze only customers who participate at financial market. This is, clients who has gained or loss capital. We filter dataset using this restriction.
\[ {Benefits}={Capital Gain}-{Capital Loss~} != {0} \]
Best model using 5 variables
\[ {\large Income \sim }{ \ Maritalstatus+Educationnum+Relationship+Ocupation+Benefits} \]
What are the most influential variables now?